Bayesian Inference on Mixtures of Distributions * 
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Abstract 

This survey covers state-of-the-art Bayesian techniques for the estimation of mixtures. It comple- 



ments the earlier Marin et al. ( 2005 I by studying new types of distributions, the multinomial, latent 



class and t distributions. It also exhibits closed form solutions for Bayesian inference in some discrete 
setups. Lastly, it sheds a new light on the computation of Bayes factors via the approximation of 



Chib ( 1995 I 



1 Introduction 

Mixture models are fascinating objects in that, while based on elementary distributions, they offer a 
much wider range of modeling possibilities than their components. They also face both highly complex 
computational challenges and delicate inferential derivations. Many statistical advances have stemmed 
from their study, the most spectacular example being the EM algorithm. In this short review, we choose 



to focus solely on the Bayesian approach to those models (Robert and Casella 2004 1. Friihwirth-Schnatter 



(2006) provides a book-long and in-depth coverage of the Bayesian processing of mixtures, to which we 



refer the reader whose interest is woken by this short review, while [MacLachlan and Peel (20001 give a 
broader perspective. 

Without opening a new debate about the relevance of the Bayesian approach in general, we note that 



the Bayesian paradigm (see, e.g., Robert 20011 allows for probability statements to be made directly 
about the unknown parameters of a mixture model, and for prior or expert opinion to be included in 
the analysis. In addition, the latent structure that facilitates the description of a mixture model can be 
naturally aggregated with the unknown parameters (even though latent variables are not parameters) 
and a global posterior distribution can be used to draw inference about both aspects at once. 

This survey thus aims to introduce the reader to the construction, prior modelling, estimation and 
evaluation of mixture distributions within a Bayesian paradigm. Focus is on both Bayesian inference 
and computational techniques, with light shed on the implementation of the most common samplers. 
We also show that exact inference (with no Monte Carlo approximation) is achievable in some particular 
settings and this leads to an interesting benchmark for testing computational methods. 



*Kate Lee is a PhD candidate at the Queensland University of Technology, Jean-Michel Marin is a researcher at INRIA, 
Universite Paris Sud, and adjunct professor at Ecole Polytechnique, Kerrie Mengersen is professor at the Queensland 
University of Technology, and Christian P. Robert is professor in Universite Paris Dauphine and head of the Statistics 
Laboratory of CREST. 
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In Section [2] we introduce mixture models, including the missing data structure that originally 
appeared as an essential component of a Bayesian analysis, along with the precise derivation of the 
exact posterior distribution in the case of a mixture of Multinomial distributions. Section |3] points out 
the fundamental difficulty in conducting Bayesian inference with such objects, along with a discussion 
about prior modelling. Section |4] describes the appropriate MCMC algorithms that can be used for 
the approximation to the posterior distribution on mixture parameters, followed by an extension of this 
analysis in Section |5] to the case in which the number of components is unknown and may be derived 
from approximations to Bayes factors, including the technique of Chib ( 1995 1 and the robustification of 
( |2003| ). 



Berkhof et al. 



2 Finite mixtures 
2.1 Definition 

A mixture of distributions is defined as a convex combination 

,7 J 
^PjfA^) > ^Pj = ^^ Pj >0, J > 1 , 

of standard distributions fj. The pj's are called weights and are most often unknown. In most cases, the 
interest is in having the fj's parameterised, each with an unknown parameter 9j, leading to the generic 
parametric mixture model 

J 

J2p,f{x\e,). (1) 

The dominating measure for (jlj is arbitrary and therefore the nature of the mixture observations 
widely varies. For instance, if the dominating measure is the counting measure on the simplex of M™ 

I 1=1 J 

the /j's may be the product of £ independent Multinomial distributions, denoted "A4m{£; qji, qjm) = 
Qji, qjjYi)" , with TO modalities, and the resulting mixture 

7 

) (2) 

i=i 

is then a possible model for repeated observations taking place in Sm.i- Practical occurrences of such 
models are repeated observations of contingency tables. In situations when contingency tables tend 
to vary more than expected, a mixture of Multinomial distributions should be more appropriate than 
a single Multinomial distribution and it may also contribute to separation of the observed tables in 
homogeneous classes In the following, we note qj. = (qji, ■ ■ ■ , qjm)- 

Example 1. For J = 2, m = 4, pi = = -5, qi. = (.2, .5, .2, .1), ga- = (-3, .3, .1, .3) and £ ^ 20, we 
simulate n = 50 independent realisations from model That corresponds to simiulate some 2x2 
contingency tables whose total sum is equal to 20. Figure [l] gives the histograms for the four entries of 
the contingency tables. M 

Another case where mixtures of Multinomial distributions occur is the latent class model where d 
discrete variables are observed on each of n individuals (iMagi dson and Vermunt|2000 1 . The observations 
(1 < z < n) are Xj — (xn, . . . , Xid), with Xiy taking values within the to„ modalities of the w-th variable. 
The distribution of is then 

J d 
3=1 i=l 



2 



nn 



R 



n n 



Figure 1: For J = 2, pi = p2 = .5, qi. = (.2, .5, .2, .1), q2. — (.3, .3, .1, .3), 1 — 20 and n = 50 independent 
simulations: histograms of the m = 4 entries. 



so, strictly speaking, this is a mixture of products of Multinomials. The applications of this peculiar 
modelling are numerous: in medical studies, it can be used to associate several symptoms or pathologies; 
in genetics, it may indicate that the genes corresponding to the variables are not sufficient to explain the 
outcome under study and that an additional (unobserved) gene may be influential. Lastly, in marketing, 
variables may correspond to categories of products, modalities to brands, and components of the mixture 
to different consumer behaviours: identifying to which group a customer belongs may help in suggesting 
sales, as on Web-sale sites. 

Similarly, if the dominating measure is the counting measure on the set of the integers N, the /j's 
may be Poisson distributions 'P{Xj) {Xj > 0). We aim then to make inference about the parameters 
{pj,Xj) from a sequence {xi)i^i^,,,^n of integers. 

The dominating measure may as well be the Lebesgue measure on M, in which case the f{x\9ys may 
all be normal distributions or Student's t distributions (or even a mix of both), with 9 representing the 
unknown mean and variance, or the unknown mean and variance and degrees of freedom, respectively. 
Such a model is appropriate for datasets presenting multimodal or asymmetric features, like the aerosol 
dataset from ,Nilsson and Kulmala, (2006, ) presented below. 

Example 2. The estimation of particle size distribution is important in understanding the aerosol 
dynamics that govern aerosol formation, which is of interest in environmental and health modelling. 
One of the most important physical properties of aerosol particles is their size; the concentration of 
aerosol particles in terms of their size is referred to as the particle size distribution. 

The data studied by Nilsson and Kulmala ( 2006 1 and represented in Figure |2] is from Hyytiala, a 



measurement station in Southern Finland. It corresponds to a full day of measurement, taken at ten 
minute intervals. < 

While the definition ([ij of a mixture model is elementary, its simplicity does not extend to the 
derivation of either the maximum likelihood estimator (when it exists) or of Bayes estimators. In fact, 
if we take n iid observations x = {xi, . . . , x„) from ([l]), with parameters 

P = (Pi...,Pj) and 6 ^ (Oi, . . . ,9j) , 

the full computation of the posterior distribution and in particular the explicit representation of the 
corresponding posterior expectation involves the expansion of the likelihood 

n J 

L{e,p\^) = l[Y.p,f{x,\9,) (3) 
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Figure 2: Histogram of the aerosol diameter dataset, along with a normal (red) and a t (blue) modelling. 

into a sum of J" terms, with some exceptions (see, for example Section |3]). This is thus computationally 
too expensive to be used for more than a few observations. This fundamental computational difficulty 
in dealing with the models ([l]) explains why those models have often been at the forefront for applying 
new technologies (such as MCMC algorithms, see Section |4|. 

2.2 Missing data 

Mixtures of distributions are typical examples of latent variable (or missing data) models in that a sample 
xi, . . . ,Xn from (jlj can be seen as a collection of subsamples originating from each of the / (a;i|6'j)'s, 
when both the size and the origin of each subsample may be unknown. Thus, each of the x^'s in the 
sample is a priori distributed from any of the fj's with probabilities pj. Depending on the setting, 
the inferential goal behind this modeling may be to reconstitute the original homogeneous subsamples, 
sometimes called clusters, or to provide estimates of the parameters of the different components, or even 
to estimate the number of components. 

The missing data representation of a mixture distribution can be exploited as a technical device to 
facilitate (numerical) estimation. By a demarginalisation argument, it is always possible to associate to 
a random variable Xi from a mixture ([T|) a second (finite) random variable Zi such that 

Xi\z^^ z f{x\0,) , V{z^=j)^pj. (4) 

This auxiliary variable Zi identifies to which component the observation Xi belongs. Depending on the 
focus of inference, the z^'s may [or may not] be part of the quantities to be estimated. In any case, keeping 
in mind the availability of such variables helps into drawing inference about the "true" parameters. This 



is the technique behind the EM algorithm of Dempster et al. ( 1977 1 as well as the "data augmentation" 



algorithm of , Tanner and Wong, (1987j that started MCMC algorithms. 

2.3 The necessary but costly expansion of the hkeUhood 

As noted above, the likelihood function ([S]) involves J" terms when the n inner sums are expanded, that 
is, when all the possible values of the missing variables Zi are taken into account. While the likelihood 
at a given value {0,p) can be computed in 0{nJ) operations, the computational difficulty in using 
the expanded version of (|3| precludes analytic solutions via maximum likelihood or Bayesian inference. 
Considering n iid observations from model (jlJ, if 7r(0,p) denotes the prior distribution on (0,p), the 
posterior distribution is naturally given by 

7r(0,p|x)(x [\{Y.P,f{x^\eM ^(0,p). 
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It can therefore be computed in 0(nJ) operations up to the normahsing [marginal] constant, but, similar 
to the likelihood, it does not provide an intuitive distribution unless expanded. 

Relying on the auxiliary variables z = {zi, . . . , Zn) defined in Q, we take Z to be the set of all J" 
allocation vectors z. For a given vector (rii, . . . ,nj) of the simplex {ni + . . . + nj = n}, we define a 
subset of Z, 

{n n 
z : ^Il2.=i = ni,.. . = nj \ , 

'i=i 1=1 J 

that consists of all allocations z with the given allocation sizes (ni, . . . ,rij), relabelled by j G N when 
using for instance the lexicographical ordering on (ni,...,nj). The number of nonnegative integer 
solutions to the decomposition of n into k parts such that ni + . . . + nj = n is equal to ( |FellCT|[l970| 

+ j-r 



Thus, we have the partition Z = Uj^j^Zj. Although the total number of elements of Z is the typically 
unmanageable J", the number of partition sets is much more manageable since it is of order n'^~^/( J— 1)!. 



It is thus possible to envisage an exhaustive exploration of the Zj''s. (Casella et al. 2004 did take 
advantage of this decomposition to propose a more efficient important sampling approximation to the 
posterior distribution.) 

The posterior distribution can then be written as 



^(0,p|x)=^^c.(z)7r(0,p|x,z) , (5) 



where lo (z) represents the posterior probability of the given allocation z. (See Section 2.4 for a derivation 



of oj (z).) Note that with this representation, a Bayes estimator of (0,p) can be written as 

r 

^^c.(z)E-[0,p|x,z] . (6) 

1=1 -ieZi 

This decomposition makes a lot of sense from an inferential point of view: the Bayes posterior distribution 
simply considers each possible allocation z of the dataset, allocates a posterior probability lj (z) to this 
allocation, and then constructs a posterior distribution 7r(0,p|x,z) for the parameters conditional on 
this allocation. Unfortunately, the computational burden is of order 0( J"). This is even more frustrating 
when considering that the overwhelming majority of the posterior probabilities oj (z) will be close to zero 
for any sample. 

2.4 Exact posterior computation 

In a somewhat paradoxical twist, we now proceed to show that, in some very special cases, there exist 
exact derivations for the posterior distribution! This surprising phenomenon only takes place for discrete 
distributions under a particular choice of the component densities f{x\9i). In essence, the /(x|0i)'s must 
belong to the natural exponential families, i.e. 

f{x\e,) = h{x) exp {e, ■ R{x) - , 



to allow for sufficient statistics to be used. In this case, there exists a conjugate prior (Robert 2001) 
associated with each 9 in f{x\9) as well as for the weights of the mixture. Let us consider the complete 
likelihood 



L^(0,p|x,z) = exp{0,^ • R{x,) - ^{9,^)} 



i=l 
J 



J 

nP?exp{0, .5, -n,v|/(0^.)} , 
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where Sj — X]z =j^('^«)- 1^ is easily seen that we remain in an exponential family since there exist 
sufficient statistics with fixed dimension, (rii, . . . , nj, Si, ... , Sj). Using a Dirichlet prior 

T{ai + ... + aj) I „ 1 

on the vector of the weights {pi, . . . ,pj) defined on the simplex of M'' and (independent) conjugate priors 
on the 9j's, 

7r(0j) cxexp{0, •Tj-(5,^'(^,)} , 
the posterior associated with the complete likelihood L'^{9, p|x, z) is then of the same family as the prior: 

7r(0,p|x,z) cx 7r(0,p) x L'^(0,p|x,z) 



J 

Qt — 1 



Hp? ' exp{e,-T,^ 6,^(9,)} 

xp;^cxp{9,-S,-n,^{9,)} 

J 

the parameters of the prior get transformed from aj to aj + nj , from Tj to tj + Sj and from Sj to Sj + rij . 

If we now consider the observed likelihood (instead of the complete likelihood), it is the sum of the 
complete likelihoods over all possible configurations of the partition space of allocations, that is, a sum 
over J" terms, 

J 

J2l[p;'exp{9,-S,~n,^{9,)} . 

z j = l 

The associated posterior is then, up to a constant, 

E n^?^"^^' exp{0, . (r, + S,) - {n, + S,)^{9,)} 

Z J = l 

= E ^(2)7r(0,p|x,z) , 

Z 

where uj{z) is the normahsing constant that is missing in 

{[ exp {9, ■ {r, + S,) (n, + S,)^(9,)} . 

The weight a^(z) is therefore 

nf=i -^(fij + Oj ) 

a;(z) cx X K{t. + S*,-, 71,- + (5,) , 

r(E;.i{n,+a,}) 

if K{t, S) is the normalising constant of exp \9j ■ r — (5^'(0j)}, i.e. 



if (r, <5) = y exp {6ij • T - S^(9j)) A9^ . 



Unfortunately, except for very few cases, like Poisson and Multinomial mixtures, this sum does not 
simplify into a smaller number of terms because there exist no summary statistics. From a Bayesian 
point of view, the complexity of the model is therefore truly of magnitude 0(J"). 

We process here the cases of both the Poisson and Multinomial mixtures, noting that the former case 
was previously exhibited by Fearnhead (20051. 
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Example 3. Consider the case of a two component Poisson mixture, 



Xi, . . . ,Xn'^ pViXl) + (i - p)riX2) , 

with a uniform prior on p and exponential priors 8xp{ri) and £xp{t2) on Ai and A2, respectively. For 
such a model, Sj = Xl^i^j normalising constant is then equal to 

/oo 
expjAjT — 5\og{\j)} d\j 
-OQ 
POO 

= J AJ"^ cxp{-6\j} dXj = 6-^ T{t) . 

The corresponding posterior is (up to the overall normalisation of the weights) 

2 

— 9 Trf^jpixjz) 

2 

= E ^TTT)^ -(^,p|x,z) 

z ^ ^ 

Z J — 1 ^ ^ 

7r(0, p|x, z) corresponds to a 6(1 + n^, 1 + n — rij) (Beta distribution) on pj and to a G{Sj + l,Tj + rij) 
(Gamma distribution) on 5j, {j — 1,2). 

An important feature of this example is that the above sum does not involve all of the 2" terms, simply 
because the individual terms factorise in (ni, n2, Si, S2) that act like local sufficient statistics. Since n2 — 
n — ni and S2 —J2^i~^i' the posterior only requires as many distinct terms as there are distinct values 
of the pair (ni. Si) in the completed sample. For instance, if the sample is (0, 0, 0, 1, 2, 2, 4), the distinct 
values of the pair (m, Si) are (0, 0), (1, 0), (1, 1), (1, 2), (1, 4), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6), 
. . . , (6, 5), (6, 7), (6, 8), (7, 9). Hence there are 41 distinct terms in the posterior, rather than 2^ = 256. M 

Let n = {ni, . . . ,nj) and S = (5*1, . . . , Sj). The problem of computing the number (or cardinal) 
/i„(n, S) of terms in the sum with an identical statistic (n, S) has been tackled by Fearnhead ( 2005[ ), who 
proposes a recurrent formula to compute /i„(n, S) in an efficient book-keeping technique, as expressed 
below for a k component mixture: 

Ifsj denotes the vector of length J made of zeros everywhere except at component j where it is equal 
to one, if 

n== (ni,...,nj), and n - = (ni, . . . , - 1, . . . , nj) , 

then 

J 

Hi{ej,R{xi)ej) ^ 1 , Vj G {1, . . . , J} , and ^„(n, S) ^ ^„_i(n - , S - i? (2:„) e^) . 



Example 4. Once the /i„(n, S)'s are all recursively computed, the posterior can be written as 

2 

J2 A*n(n, S) n ri.l S,\/{tj + n,)^^+' 7r(0, p|x, n, S) , 

(n,S) J = l 

up to a constant, and the sum only depends on the possible values of the "sufficient" statistic (n, S). 
This closed form expression allows for a straightforward representation of the marginals. For instance. 
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(n,A) 


J = 2 


J = 3 


J = 4 


(10, .1) 


11 


66 


286 


(10,1) 


52 


885 


8160 


(10,10) 


166 


7077 


120,908 


(20, .1) 


57 


231 


1771 


(20,1) 


260 


20,607 


566,512 


(20, 10) 


565 


100,713 




(30, .1) 


87 


4060 


81,000 


(30,1) 


520 


82,758 




(30, 10) 


1413 


637,020 





Table 1: Number of pairs (n, S) for simulated datasets from a Poisson 'P{X) and different numbers of 
components. (Missing terms are due to excessive computational or storage requirements.) 



up to a constant, the marginal in Ai is given by 

2 

n "^J-^r/i^j + ^^i)'"^'^^ i'^i + l)'^^+'A'5i exp{-(ni + l)Xi}/n,l 

z j = l 



2 



= Mn(n,S) l[n,\S,\/{T,+n,Y 

(n,S) j = l 

X (ni + nf'+^Xf' exp{-(ni + ri)Ai}/ni! . 

The marginal in A2 is 

2 

^Mn(n,S) l[n,\S,l/{r,+n,f^+' 

(n,S) ] = 1 

{n2 + T2f'+^Xl' exp{-(n2 + T2)A2}/n2! , 

again up to a constant. 

Another interesting outcome of this closed form representation is that marginal densities can also be 
computed in closed form. The marginal distribution of x is directly related to the unnormalised weights 
in that 

m(x) - }^ u;{z) ^ ^ firkin, S) -— — 

(n,S) ^''+'>- 

up to the product of factorials 1/xi! • • ■ Xnl (but this product is irrelevant in the computation of the 
Bayes factor). 

Now, even with this considerable reduction in the complexity of the posterior distribution, the number 
of terms in the posterior still explodes fast both with n and with the number of components J, as shown 
through a few simulated examples in Table [l] The computational pressure also increases with the range 
of the data, that is, for a given value of (J, n), the number of values of the sufficient statistics is much 
larger when the observations are larger, as shown for instance in the first three rows of Table [T] a 
simulated Poisson 'P{X) sample of size 10 is mostly made of O's when A = .1 but mostly takes different 
values when A = 10. The impact on the number of sufficient statistics can be easily assessed when J = 4. 
(Note that the simulated dataset corresponding to (n, A) = (10,. 1) in Table [T] happens to correspond 
to a simulated sample made only of O's, which explains the n + 1 = 11 values of the sufficient statistic 
{m.Si) = (ni,0) when J = 2.) 

Example 5. If we have n observations n, = (n^i, . . . , n,;™) from the A4ultinomial mixture 

pMm{di;qii,. . . ,qirn) + (1 - p)Mm{di; 921, ■ • ■,q2m) 
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where nn + • • • + riim = di and + • • • + qim = 921 + • • • + q_2m = 1, the conjugate priors on the qj^s 
are Dirichlet distributions, (j = 1,2) 

. . . , qjyn) ^ T^ia^i, . . . , a-,-„i) , 

and we use once again the uniform prior on p. (A default choice for the aj^^s is oij^ — 1/2.) Note that the 
dj's may differ from observation to observation, since they are irrelevant for the posterior distribution: 
given a partition z of the sample, the complete posterior is indeed 

2 2 m 

^^il-pT^WXlq^l^---^- >^X{X{qj:'\ 

j = lZi=j j = lv = l 

(where rij is the number of observations allocated to component j) up to a normalising constant that 
does not depend on z. < 

More generally, considering a Multinomial mixture with m components, 

J 

~ ^PjMm{di;qji, . . .,qjm) , 
the complete posterior is also directly available, as 

J J J m 



-1/2 

J = l j = iZi=j j=lv=l 



once more up to a normalising constant. 

Since the corresponding normalising constant of the Dirichlet distribution is 

n"Lir(ajt,) 



T{aji H h ajm) ' 

the overall weight of a given partition z is 

n"Lir(ait, + 5*1^) ^ U7=i^i(^2v + S2v) 
r(aii H h aim + 5*1.) r(a2i H l-Q:2m + S'2.) 

where Sji is the sum of the riji's for the observations i allocated to component j and 

Sji = ^ Uji and 5'^. = ^ S'j^ . 

Zi=j i 

Given that the posterior distribution only depends on those "sufficient" statistics Sij and n^, the same 
factorisation as in the Poisson case applies, namely we simply need to count the number of occurrences 
of a particular local sufficient statistic (ni, Sn, . . . , Sjm) and then sum over all values of this sufficient 
statistic. The book-keeping algorithm of Fearnhead (20051 applies. Note however that the number of 
different terms in the closed form expression is growing extremely fast with the number of observations, 
with the number of components and with the number k of modalities. 

Example 6. In the case of the latent class model, consider the simplest case of two variables with two 
modalities each, so observations are products of Bernoulli's, 

X ^ pB{qn)B{qi2) + (1 - p)B{q2i)Biq22) ■ 

We note that the corresponding statistical model is not identifiable beyond the usual label switching issue 
detained in Section [3TT] Indeed, there are only two dichotomous variables, four possible realizations for 
the x's, and five unknown parameters. We however take advantage of this artificial model to highlight 
the implementation of the above exact algorithm, which can then easily uncover the unidentifiability 
features of the posterior distribution. 
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The complete posterior distribution is the sum over all partitions of the terms 



nn 



"1/2 



'JV 



where Sj^ — ^ivj sufficient statistic is thus (ni, sn, S12, S21, S22), of order O^iv'). Using the 

benchmark data of Stouffer and Toby (19511, made of 216 sample points involving four binary variables 



related with a sociological questionnaire, we restricted ourselves to both first variables and 50 observations 
picked at random. A recursive algorithm that eliminated replicates gives the results that (a) there are 
5, 928 different values for the sufficient statistic and (b) the most common occurrence is the middle 
partition (26,6,11,5,10), with 7.16 x 10^^ replicas (out of 1.12 x 10^^ total partitions). The posterior 
weight of a given partition is 



r(ni + i)r(n - '^1 +i) -Q -Q r(sj 



l/2)r(n, 



1/2) 



r(n + 2) 



i=i v=i 



T{nj + 1) 



- n n ^(^J- + l/2)r(n, - + 1/2) /ni! (n - m)! (n + 1)! , 

multiplied by the number of occurrences. In this case, it is therefore possible to find exactly the most 
likely partitions, namely the one with ni — 11 and 712 — 39, sn = 11, S12 = 8, S21 — 0, S22 — 17, and 
the symmetric one, which both only occur once and which have a joint posterior probability of 0.018. It 
is also possible to eliminate all the partitions with very low probabilities in this example. -4 



3 Mixture inference 

Once again, the apparent simplicity of the mixture density should not be taken at face value for inferential 
purposes; since, for a sample of arbitrary size n from a mixture distribution ([T]), there always is a non-zero 
probability (1 — pj)" that the jth subsample is empty, the likelihood includes terms that do not bring 
any information about the parameters of the i-th component. 

3.1 Nonidentifiability, hence label switching 

A mixture model ([ij is senso stricto never identifiable since it is invariant under permutations of the 
indices of the components. Indeed, unless we introduce some restriction on the range of the ^^'s, we 
cannot distinguish component number 1 (i.e., 9i) from component number 2 (i.e., ^2) in the likelihood, 
because they are exchangeable. This apparently benign feature has consequences on both Bayesian 
inference and computational implementation. First, exchangeability implies that in a J component 
mixture, the number of modes is of order 0(J!). The highly multimodal posterior surface is therefore 
difficult to explore via standard Markov chain Monte Carlo techniques. Second, if an exchangeable prior 
is used on = (6*1, ... , 9j), all the marginals of the ^^ 's are identical. Other and more severe sources of 
unidentifiability could occur as in Example [6j 

Example 7. (Example |6] continued) If we continue our assessment of the latent class model, with two 



variables with two modalities each, based on subset of data extracted from Stouffer and Toby (19511, 
under a Beta, B{a,b), prior distribution on p the posterior distribution is the weighted sum of Beta 
B{ni + a, n — ni + b) distributions, with weights 

22 , 
/i„(n,s) Y[ Yl T{sj, + l/2)r(n, - Sj„ + 1/2) /m! (n - m)! (n + 1)! , 

j = l v = l ' 

where ^„(n, s) denotes the number of occurrences of the sufficient statistic. Figure |3] provides the 



posterior distribution for a subsample of the dataset of Stouffer and Toby (1951) and a = b = 1. Since 
p is not identifiable, the impact of the prior distribution is stronger than in an identifying setting: using 
a Beta B{a, b) prior on p thus produces a posterior [distribution] that reflects as much the influence of 
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Figure 3: Exact posterior distribution of p for a sample of 50 observations from the dataset of StoufFer 
and Toby] (|1951[) and a = = 1. 



(a, b) as the information contained in the data. While a B{1, 1) prior, as in Figurejs] leads to a perfectly 
symmetric posterior with three modes, using an assymetric prior with a <C 6 strongly modifies the range 
of the posterior, as illustrated by Figure |4] 

Identifiability problems resulting from the exchangeability issue are called "label switching" in that the 
output of a properly converging MCMC algorithm should produce no information about the component 
labels (a feature which, incidentally, provides a fast assessment of the performance of MCMC solutions, 
as proposed in Celeux et al. 2000 1 . A nai've answer to the problem proposed in the early literature 



is to impose an identifiability constraint on the parameters, for instance by ordering the means (or the 
variances or the weights) in a normal mixture. From a Bayesian point of view, this amounts to truncating 
the original prior distribution, going from tt (0, p) to 



^tll<...<fJ.J 



While this device may seem innocuous (because indeed the sampling distribution is the same with or 
without this constraint on the parameter space), it is not without consequences on the resulting inference. 
This can be seen directly on the posterior surface: if the parameter space is reduced to its constrained 
part, there is no agreement between the above notation and the topology of this surface. Therefore, 
rather than selecting a single posterior mode and its neighbourhood, the constrained parameter space 
will most likely include parts of several modal regions. Thus, the resulting posterior mean may well end 
up in a very low probability region and be unrepresentative of the estimated distribution. 

Note that, once an MCMC sample has been simulated from an unconstrained posterior distribution, 
any ordering constraint can be imposed on this sample, that is, after the simulations have been completed. 



for estimation purposes as stressed by Stephens ( 1997) . Therefore, the simulation (if not the estimation) 
hindrance created by the constraint can be completely bypassed. 

Once an MCMC sample has been simulated from an unconstrained posterior distribution, a natural 
solution is to identify one of the J! modal regions of the posterior distribution and to operate the 



relabelling in terms of proximity to this region, as in Marin et al. (20051. Similar approaches based on 



clustering algorithms for the parameter sample are proposed in Stephens (19971 and Celeux et al. (2000 1, 



and they achieve some measure of success on the examples for which they have been tested. 



3.2 Restrictions on priors 

From a Bayesian point of view, the fact that few or no observation in the sample is (may be) generated 
from a given component has a direct and important drawback: this prohibits the use of independent 
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Figure 4: Exact posterior distributions of p for a sample of 50 observations from the dataset of [StoufFer] 
and Tobyl (p^5T| under Beta B{a, b) priors when a = .01, .05, .1, .05, 1 and 6 = 100, 50, 20, 10, 5, 1. 



improper priors, 



since, if 



Tr{Oj)dOj — oo 



then for any sample size n and any sample x, 



tt{9, p|x)d0dp — oo . 



The ban on using improper priors can be considered by some as being of little importance, since 
proper priors with large variances could be used instead. However, since mixtures are ill-posed problems, 
this difficulty with improper priors is more of an issue, given that the influence of a particular proper 
prior, no matter how large its variance, cannot be truly assessed. 

There exists, nonetheless, a possibility of using improper priors in this setting, as demonstrated for 



instance by Mengersen and Robert ( 1996 1, by adding some degree of dependence between the component 



parameters. In fact, a Bayesian perspective makes it quite easy to argue against independence in mixture 
models, since the components are only properly defined in terms of one another. For the very reason 
that exchangeable priors lead to identical marginal posteriors on all components, the relevant priors 
must contain some degree of information that components are different and those priors must be explicit 
about this difference. 



The proposal of Mengersen and Robert (19961, also described in Marin et al. (20051, is to introduce 



first a common reference, namely a scale, location, or location-scale parameter (/x,t), and then to define 
the original parameters in terms of departure from those references. Under some conditions on the 



reparameterisation, expressed in Robert and Titterington (19981, this representation allows for the use 



of an improper prior on the reference parameter {h,t). See Wasserman (2000), Perez and Berger (20021, 
Moreno and Liseo ( 2003 1 for different approaches to the use of default or non-informative priors in the 



setting of mixtures. 



4 Inference for mixtures with a known number of components 

In this section, we describe different Monte Carlo algorithms that are customarily used for the approx- 
imation of posterior distributions in mixture settings when the number of components J is known. We 
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start in Section 4.1 with a proposed solution to the label-switching problem and then discuss in the 
following sections Gibbs sampling and Metropolis-Hastings algorithms, acknowledging that a diversity 
of other algorithms exist (tempering, population Monte Carlo...), see Robert and Casella (2004). 



4.1 Reordering 



Section |3.1| discussed the drawbacks of imposing identifiability ordering constraints on the parameter 
space for estimation performances and there are similar drawbacks on the computational side, since 
those constraints decrease the explorative abilities of a sampler and, in the most extreme cases, may 
even prevent the sampler from converging (see Celeux et al. 2000). We thus consider samplers that 



evolve in an unconstrained parameter space, with the specific feature that the posterior surface has a 
number of modes that is a multiple of J!. Assuming that this surface is properly visited by the sampler 
(and this is not a trivial assumption), the derivation of point estimates of the parameters of ([T|) follows 
from an ex-post reordering proposed by Marin et al. (2005) which we describe below. 



Given a simulated sample of size M, a starting value for a point estimate is the naive approximation 
to the Maximum a Posteriori (MAP) estimator, that is the value in the sequence (6>,p)(') that maximises 
the posterior, 

Z* = arg max 7r((0, p)*^'^ |x) 
;=i,...,j\/ 

Once an approximated MAP is computed, it is then possible to reorder all terms in the sequence (0,p)(') 
by selecting the reordering that is the closest to the approximate MAP estimator for a specific distance in 
the parameter space. This solution bypasses the identifiability problem without requiring a preliminary 
and most likely unnatural ordering with respect to one of the parameters (mean, weight, variance) of 
the model. Then, after the reordering step, an estimation of 6j is given by 

M 



1=1 



4.2 Data augmentation and Gibbs sampling approximations 



The Gibbs sampler 


is the most commonly used approach in Bayesian mixture estimation ( 


Diebolt and 


Robert 


1990 


1994 


Lavine and West 


1992[ jVerdinelli and Wasserman 


1992 


Escobar and West 


1995) 


because it takes advantage of the missing data structure of the z^'s uncovered 


in Section 2.2 





ulation of z, p and 9 conditional on one another and on the data, using the full conditional distributions 
derived from the conjugate structure of the complete model. (Note that p only depends on the missing 
data z.) 



Gibbs sampling for mixture models 

0. Initialization: choose p'*'' and arbitrarily 

1. Step t. For i = 1, . . . 

1.1 Generate z'p (i = 1, . . . ,n) from {j ^ 1, . . . , J) 



4*' = jbi* ^K^^! 1 oc 



1.2 Generate p*-*' from 7r(p|z(*^) 

1.3 Generate 0^*' from 7r(0|z(*),x). 



As always with mixtures, the convergence of this MCMC algorithm is not as easy to assess as it 
seems at first sight. In fact, while the chain is uniformly geometrically ergodic from a theoretical point of 
view, the severe augmentation in the dimension of the chain brought by the completion stage may induce 
strong convergence problems. The very nature of Gibbs sampling may lead to "trapping states" , that is, 
concentrated local modes that require an enormous number of iterations to escape from. For example. 
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components with a small number of allocated observations and very small variance become so tightly 
concentrated that there is very little probability of moving observations in or out of those components, 



as shown in Marin et al. (20051. As discussed in Section 2.3 Celeux et al. (20001 show that most MCMC 



samplers for mixtures, including the Gibbs sampler, fail to reproduce the permutation invariance of the 
posterior distribution, that is, that they do not visit the J! replications of a given mode. 

Example 8. Consider a mixture of normal distributions with common variance and unknown means 
and weights 



J 

E 



This model is a particular case of model ([T|) and is not identifiable. Using conjugate exchangeable priors 



p~I?(l,...,l), /ij— AA(0,10a2), 
it is straightforward to implement the above Gibbs sampler: 
• the weight vector p is simulated as the Dirichlet variable 



'£xp{l/2) 



the inverse variance as the Gamma variable 



g{{n + 2)/2,{l/2) 



E 



oT + ^^ 



n 



• and, conditionally on cr, the means fij are simulated as the Gaussian variable 



Af{njXj/{nj + 0.1), crV("j + 0.1)) : 



where rij — J2z = 



and 



Note that this choice of implementation allows for the block simulation of the means-variance group, 
rather than the more standard simulation of the means conditional on the variance and of the variance 
conditional on the means (as in Diebolt and Robert [1994 1. M 
Consider the benchmark dataset of the galaxy radial speeds described for instance in |Roeder and| 
Wasserman (19971. The output of the Gibbs sampler is summarised on Figure [s] in the case of J = 3 
components. As is obvious from the comparison of the three first histograms (and of the three following 
ones), label switching does not occur with this sampler: the three components remain isolated during 
the simulation process. -4 



Note that Geweke ( 2007 ) (among others) dispute the relevance of asking for proper mixing over the 
k\ modes, arguing that on the contrary the fact that the Gibbs sampler sticks to a single mode allows for 
an easier inference. We obviously disagree with this perspective: first, from an algorithmic point of view, 
given the unconstrained posterior distribution as the target, a sampler that fails to explore all modes 
clearly fails to converge. Second, the idea that being restricted to a single mode provides a proper repre- 
sentation of the posterior is naively based on an intuition derived from mixtures with few components. 
As the number of components increases, modes on the posterior surface get inextricably mixed and a 
standard MCMC chain cannot be garanteed to remain within a single modal region. Furthermore, it is 
impossible to check in practice whether not this is the case. 



In his defence of "simple" MCMC strategies supplemented with postprocessing steps, Geweke ( 2007 ) 
states that 

[Celeux et al.'s (2000)] argument is persuasive only to the extent that there are mix- 
ing problems beyond those arising from permutation invariance of the posterior distribution. 
Celeux et al. (2000) does not make this argument, indeed stating "The main defect of the 
Gibbs sampler from our perspective is the ultimate attraction of the local modes" (p. 959). 
That article produces no evidence of additional mixing problems in its examples, and we are 
not aware of such examples in the related literature. Indeed, the simplicity of the posterior 
distributions conditional on state assignments in most mixture models leads one to expect no 
irregularities of this kind. 
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Figure 5: From the left to the right, histograms of the parameters (pi,p2,P3, /^i, /i2, /^s, cr) of a normal 
mixture with k = 3 components based on 10* iterations of the Gibbs sampler and the galaxy dataset, 
evolution of the a and of the log- likelihood. 



There are however clear irregularities in the convergence behaviour of Gibbs and Metropolis-Hastings 



algorithms as exhibited in Marin et al. ( 2005 ) and Marin and Robert ( 2007 1 (Figure 6.4) for an identifiable 
two-component normal mixture with both means unknown. In examples as such as those, there exist 
secondary modes that may have much lower posterior values than the modes of interest but that are 
nonetheless too attractive for the Gibbs sampler to visit other modes. In such cases, the posterior 
inference derived from the MCMC output is plainly incoherent. (See also lacobucci et al.| ( |2008 1 for 
another illustration of a multimodal posterior distribution in an identifiable mixture setting.) 

However, as shown by the example below, for identifiable mixture models, there is no label switching 
to expect and the Gibbs sampler may work quite well. While there is no foolproof approach to check 



MCMC convergence (Robert and Casella 20041, we recommend using the visited likelihoods to detect 



lack of mixing in the algorithms. This does not detect the label switching difficulties (but individual 
histograms do) but rather the possible trapping of a secondary mode or simply the slow exploration of 
the posterior surface. This is particularly helpful when implementing multiple runs in parallel. 

Example 9. (Example |2] continued) Consider the case of a mixture of Student's t distributions with 
known and different numbers of degrees of freedom 



i=i 

This mixture model is not a particular case of model ([ij and is identifiable. Moreover, since the noncentral 
t distribution t„{fi, ti^) can be interpreted as a continuous mixture of normal distributions with a common 
mean and with variances distributed as scaled inverse random variable, a Gibbs sampler can be easily 
implemented in this setting by taking advantage of the corresponding latent variables: Xi ^ t^{^, cr^) is 
the marginal of 

Once these latent variables are included in the simulation, the conditional posterior distributions of all 
parameters are available when using conjugate priors like 



I?(l,...,l), AA(Mo,2a2)^ 
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The full conditionals for the Gibbs sampler are a Dirichlet + ni, . . . , 1 + nj) distribution on the 
weight vector, inverse Gamma 



distributions on the variances ct|, normal 



distributions on the means /ij, and inverse Gamma 




' 2 J 

distributions on the Vi. 

In order to illustrate the performance of the algorithm, we simulated 2,000 observations from the 
two-component t mixture with = 0, /i2 = 5, cr^ = = 1, vi = b, V2 — 11 and pi = 0.3. The output 
of the Gibbs sampler is summarized in Figure [6j The mixing behaviour of the Gibbs chains seems to be 
excellent, as they explore neighbourhoods of the true values. < 



-0.4 -0.2 



40 I ' ' 

1 1.5 ° 0.25 0.3 

r 



Figure 6: Histograms of the parameters, /ii, (Ti,pi, /X2, cr2, and evolution of the (observed) log- likelihood 
along 30, 000 iterations of the Gibbs sampler and a sample of 2, 000 observations. 

The example below shows that, for specific models and a small number of components, the Gibbs 
sampler may recover the symmetry of the target distribution. 

Example 10. (Exa mple [6] continued) For the latent class model, if we use all four variables with 
two modalities each inlStouffer and Toby (19511, the Gibbs sampler involves two steps: the completion 



of the data with the component labels, and the simulation of the probabilities p and qtj from Beta 
{B)(stj + .5, Uj — Stj + .5) conditional distributions. For the 216 observations, the Gibbs sampler seems 
to converge satisfactorily since the output in Figure [7] exhibits the perfect symmetry predicted by the 
theory. We can note that, in this special case, the modes are well separated, and hence values can be 
crudely estimated for qij by a simple graphical identification of the modes. 



4.3 Metropolis— Hastings approximations 

The Gibbs sampler may fail to escape the attraction of a local mode, even in a well-behaved case as in 
Example 1 where the likelihood and the posterior distributions are bounded and where the parameters 
are identifiable. Part of the difficulty is due to the completion scheme that increases the dimension 
of the simulation space and that reduces considerably the mobility of the parameter chain. A standard 
alternative that does not require completion and an increase in the dimension is the Metropolis-Hastings 
algorithm. In fact, the likelihood of mixture models is available in closed form, being computable in 0( Jn) 
time, and the posterior distribution is thus available up to a multiplicative constant. 
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Figure 7: Latent class model: histograms of p and of the gt^'s for 10'* iterations of the Gibbs sampler 
and the four variables of Stouffer and Toby (1951). The first histogram corresponds to p, the next on 
the right to qn, followed by 521 (identical), then q2i, (722, and so on. 



General Metropolis Hastings algorithm for mixture models 

0. Initialization. Choose p(°) and 0*^°^ 

1. Step t. For i = 1, . . . 

1.1 Generate (0,p) from q (e,p\e^*-^'> ,p^^'^y 



1.2 Compute 

/(x|0,pV(0,p)g(0(*-i),p(*-i)|0,p) 



/(x|6>(*-i\p(*-i))^(0(*-*),p(*-i))g(6>,p|6/(*-i),p(*-i))' 

1.3 Generate u ^ ^[0.1] 

If r > u then (S^'^pW) = (0,p) 
else (6/(*\p(*)) = (6>(*-i\p(*-i)). 

The major difference with the Gibbs sampler is that we need to choose the proposal distribution q, 
which can be a priori anything, and this is a mixed blessing! The most generic proposal is the random 
walk Metropolis-Hastings algorithm where each unconstrained parameter is the mean of the proposal 
distribution for the new value, that is, 

where uj ^ A/'(0, C^). However, for constrained parameters like the weights and the variances in a normal 
mixture model, this proposal is not efficient. 

This is indeed the case for the parameter p, due to the constraint that Yli^^iPj = 1- To solve this 



difficulty, Cappe et al. ( 2003 1 propose to overparameterise the model ([T]) as 



' 1=1 

thus removing the simulation constraint on the Pj's. Obviously, the Wj's are not identifiable, but this 
is not a difficulty from a simulation point of view and the pj's remain identifiable (up to a permutation 
of indices) . Perhaps paradoxically, using overparameterised representations often helps with the mixing 
of the corresponding MCMC algorithms since they are less constrained by the dataset or the likelihood. 
The proposed move on the Wj's is \og{wj) — log(u;j*^^'') + Uj where Uj ~ A/'(0, C^). 
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Example 11. (Example [2] continued) We now consider the more realistic case when the degrees of 
freedom of the t distributions are unknown. The Gibbs sampler cannot be implemented as such given 
that the distribution of the z/j's is far from standard. A common alternative ([Robe rt and Casella||2004 ) 
is to introduce a Metropolis step within the Gibbs sampler to overcome this difficulty. If we use the same 
Gamma prior distribution with hyperparameters {q:^,(3^) for all the i^jS, the full conditional density of 
v-j is 



Tr{vj |V, z) cx 



r(;.,/2) 



n 

Zi=j 



-(^,72+1) 



Therefore, we resort to a random walk proposal on the log(^'j)'s with scale cr = 5. (The hyperparameters 
are a,^ = h and Pi, — 2.) 

In order to illustrate the performances of the algorithm, two cases are considered: (i) all parameters 
except variances (cr^ = ct| = 1) are unknown and (ii) all parameters are unknown. For a simulated 
dataset, the results are given on Figure [8] and Figure |9] respectively. In both cases, the posterior distri- 
butions of the Vj^s exhibit very large variances, which indicates that the data is very weakly informative 
about the degrees of freedom. The Gibbs sampler does not mix well-enough to recover the symmetry in 
the marginal approximations. The comparison between the estimated densities for both cases with the 



setting is given in Figure 10 The estimated mixture densities are indistinguishable and the fit to the 
simulated dataset is quite adequate. Clearly, the corresponding Gibbs samplers have recovered correctly 
one and only one of the 2 symetric modes. 



I] 




0.25 0.3 0.35 




Figure 8: Histograms of the parameters /ii, z^i,pi, /i2, when the variance parameters are known, and 
evolution of the log-likelihood for a simulated t mixture with 2,000 points, based on 3 x 10* MCMC 
iterations. 




1\ 



0.5 0.5 



20 0.2 



■ ■ ■ 1 15 I ■ 1 0.4 I ■ 1 

AA :LU :Li_ 




Figure 9: Histograms of the parameters /ii, CTi, z^i,pi, /i2, o'2i t^2i and evolution of the log-likelihood for a 
simulated t mixture with 2, 000 points, based on 3 x lO'* MCMC iterations. 
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Figure 10: Histogram of the simulated dataset, compared with estimated t mixtures with known 
(red), known v (green), and when all parameters are unknown (blue). 





Ml 


M2 


CTl 


cr2 


1^1 


1^2 


Pi 


Student 


2.5624 


3.9918 


0.5795 


0.3595 


18.5736 


19.3001 


0.3336 


Normal 


2.5729 


3.9680 


0.6004 


0.3704 






0.3391 



Table 2: Estimates of the parameters for the aerosol dataset compared for t and normal mixtures. 



We now consider the aerosol particle dataset described in Example |2] We use the same prior distri- 
butions on the Vj^s as before, that is Q{5, 2). Figure [IT] summarises the output of the MCMC algorithm. 
Since there is no label switching and only two components, we choose to estimate the parameters by 
the empirical averages, as illustrated in Table |2] As shown by Figure [2] both t mixtures and normal 
mixtures fit the aerosol data reasonably well. -4 
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Figure 11: Histograms of parameters {fii, ai, vi,pi, /ii, CT2, 1^2) and log-likelihood of a mixture of t distri- 
butions based on 30,000 iterations and the aerosol data. 



5 Inference for mixture models with an unknown number of 
components 

Estimation of J, the number of components in ([l]), is a special type of model choice problem, for which 
there is a number of possible solutions: 



(i) direct computation of the Bayes factors ( Kass and Raftery]|1995 Chib||1995 ); 
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(ii) evaluation of an entropy distance ( Mengersen and Robert|[T996l |Sahu and Cheng|p003| ; 



(iii) generation from a joint distribution across models via reversible jump MCMC (Richardson and 
Green|p97 l or via birth-and-death processes ( |Stephens|[2000| ; 



depending on whether the perspective is on testing or estimation. We refer to Marin et al. (20051 for 



and Casella 


(2004 


and a 


Marin and Robert 


(20071 



not generated as much follow-up, except for Cappe et al. ( 2003 1 who showed that the essential mechanism 



in this approach was the same as with reversible jump MCMC algorithms. 

We focus here on the first two approaches, because, first, the description of reversible jump MCMC 
algorithms require much care and therefore more space than we can allow to this paper and, second, this 
description exemplifies recent advances in the derivation of Bayes factors. These solutions pertain more 
strongly to the testing perspective, the entropy distance approach being based on the KuUback-Leibler 
divergence between a J component mixture and its projection on the set of J — 1 mixtures, in the same 



spirit as in Dupuis and Robert (2003). Given that the calibration of the Kullback divergence is open to 
various interpretations (Mengersen and Robert|1996l|Goutis and Robert|1998| [Dupuis and Robert|2003| , 

we will only cover here some proposals regarding approximations of the Bayes factor oriented towards 
the direct exploitation of outputs from single model MCMC runs. 

In fact, the major difference between approximations of Bayes factors based on those outputs and 
approximations based on the output from the reversible jump chains is that the latter requires a suf- 
ficiently efficient choice of proposals to move around models, which can be difficult despite significant 
recent advances ( Brooks et al.|200"3 l. If we can instead concentrate the simulation effort on single models, 
the complexity of the algorithm decreases (a lot) and there exist ways to evaluate the performance of the 
corresponding MCMC samples. In addition, it is often the case that few models are in competition when 
estimating J and it is therefore possible to visit the whole range of potentials models in an exhaustive 
manner. 

We have 

n J 

j=i 



where Aj = {0,p) = {9i, . . . ,9j,pi, . . . ,pj)- Most solutions (see, e.g. 'Friihwirth- Schnatter||2006 Section 
5.4) revolve around an importance sampling approximation to the marginal likelihood integral 



m,j{x) 



J /j(x|Aj)7rj(Aj)dA, 



where J denotes the model index (that is the number of components in the present case). For instance. 



Liang and Wong (20011 use bridge sampling with simulated annealing scenarios to overcome the label 



switching problem. Steele et al. (2006) rely on defensive sampling and the use of conjugate priors 
to reduce the integration to the space of latent variables (as i n ^Casella et al. 2004| with an iterative 
construction of the importance function. Friihwirth-Schnatter (20041 also centers her approximation of 



the marginal likelihood on a bridge sampling strategy, with particular attention paid to identifiability 
constraints. A different possibility is to use (Gelfand and Dey ( 1994 1 representation: starting from an 
arbitrary density gj, the equality 



1 = j 9j{\j)d\j^ J 



= "ij(x) 



/j(x|Aj)7rj(Aj) 
9j{\j) 



/j(x|Aj)^j(Aj)dAj 



fj{-^\\j)^j{Xj) 
implies that a potential estimate of mj(x) is 



7rj(Aj|x)dAj 



mj(x) 



1/^ 



Tf^/,(x|A(;))vr,(A(*)) 



20 



when the Aj-^'s are produced by a Monte Carlo or an MCMC sampler targeted at 7rj(Aj|x). While this 



solution can be easily implemented in low dimensional settings (iChopin and Robert 2007[ ), calibrating 
the auxiliary density is always an issue. The auxiliary density could be selected as a non-parametric 
estimate of 7rfc(Aj|a;) based on the sample itself but this is very costly. Another difficulty is that the 
estimate may have an infinite variance and thus be too variable to be trustworthy, as experimented by 
[Fruhwirth-Schnatter] f2004[ ). 

Yet another approximation to the integral mj(x) is to consider it as the expectation of /j(x| Aj), when 
A J is distributed from the prior. While a brute force approach simulating Aj from the prior distribution 
is requiring a huge number of simulations (Neal 19991, a Riemann based alternative is proposed by 



Skilling (20061 under the denomination of nested sampling] however, Chopin and Robert (2007) have 



shown in the case of mixtures that this technique could lead to uncertainties about the quality of the 
approximation . 



We consider here a further solution, first proposed by Chib (19951, that is straightforward to imple- 
ment in the setting of mixtures (see Chib and JeUazkov 2001 for extensions). Although it came under 



criticism by Neal (19991 (see also |Fruhwirth-Schnatter||2004 1 , we show below how the drawback pointed 
by the latter can easily be removed. Chib's ( 1995 1 method is directly based on the expression of the 



marginal distribution (loosely called marginal likelihood in this section) in Bayes' theorem: 



mj(x) = 



/j(x|Aj)7rj(Aj) 
7rj(A,/|x) 



and on the property that the rhs of this equation is constant in Aj. Therefore, if an arbitrary value of 
Aj, Aj say, is selected and if a good approximation to 7rj(Aj|x) can be constructed, 7rj(Aj|x), Chib's 



( 1995 ) approximation to the marginal likelihood is 



m,7(x) 



/j(x|A})7rj(A}) 



In the case of mixtures, a natural approximation to 7r,/(Aj|x) is the Rao-Blackwell estimate 

'^./(AS|x) = ^f]^j(AS|x,z(*)), 



(8) 



where the z'^*^'s are the latent variables simulated by the MCMC sampler. To be efficient, this method 
requires 

(a) a good choice of Aj but, since in the case of mixtures, the likelihood is computable, Aj can be 
chosen as the MCMC approximation to the MAP estimator and, 

(b) a good approximation to 7rj(Aj|x). 



This later requirement is the core of Neal's (1999) criticism: while, at a formal level, 7r,/(A}|x) is a 



converging (parametric) approximation to 7rj(Aj|x) by virtue of the ergodic theorem, this obviously re- 
quires the chain (z^*^) to converge to its stationarity distribution. Unfortunately, as discussed previously, 
in the case of mixtures, the Gibbs sampler rarely converges because of the label switching phenomenon 



described in Section 3.1 so the approximation 7rj(Aj|x) is untrustworthy. Neal (19991 demonstrated via 
a numerical experiment that ([s]) is significantly different from the true value toj(x) when label switching 
does not occur. There is, however, a fix to this problem, also explored by Berkhof et al. (2003 1, which is 



to recover the label switching symmetry a posteriori, replacing 7rj(Aj|x) in (|8|) above with 



1 

TJ! 



51 



it)) 



aee.j t=i 



where &j denotes the set of all permutations of {1,..., J} and cr(A}) denotes the transform of A} 
where components are switched according to the permutation a. Note that the permutation can equally 
be applied to Aj or to the z'*^'s but that the former is usually more efficient from a computational 
point of view given that the sufficient statistics only have to be computed once. The justification for 
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J 


2 


3 


4 


5 


6 


7 8 


/5j(x) 


-115.68 


-103.35 


-102.66 


-101.93 


-102.88 


-105.48 -108.44 



Table 3: Estimations of the marginal likelihoods by the symmetrised Chib's approximation (based on 
10^ Gibbs iterations and, for J > 5, 100 permutations selected at random in 



this modification either stems from a Rao-Blackwellisation argument, namely that the permutations are 
ancillary for the problem and should be integrated out, or follows from the general framework of |Kong] 



et al. ( 2003 1 where symmetries in the dominating measure should be exploited towards the improvement 



of the variance of Monte Carlo estimators. 

Example 12. (Example [s] continued) In the case of the normal mixture case and the galaxy dataset, 
using Gibbs sampling, label switching does not occur. If we compute logm,/(x) using only the original 
estimate of Chib (19951 ([s]), the [logarithm of the] estimated marginal likelihood is /5j(x) = —105.1396 
for J = 3 (based on 10'^ simulations), while introducing the permutations leads to pj{x) = —103.3479. 
As already noted by Neal ( 1999) , the difference between the original Chib's (1995) approximation and the 



true marginal likelihood is close to log( J!) (only) when the Gibbs sampler remains concentrated around 
a single mode of the posterior distribution. In the current case, we have that —116.3747 -I- log(2!) = 
— 115.6816 exactly! (We also checked this numerical value against a brute-force estimate obtained by 
simulating from the prior and averaging the likelihood, up to fourth digit agreement.) A similar result 
holds for J = 3, with -105.1396 log(3!) = -103.3479. Both [Neal] ([19991) and |Frtihwirth-Schnatter 



(2004) also pointed out that the log(J!) difference was unlikely to hold for larger values of J as the 
modes became less separated on the posterior surface and thus the Gibbs sampler was more likely to 
explore incompletely several modes. For J = 4, we get for instance that the original Chib's (1995) 
approximation is —104.1936, while the average over permutations gives —102.6642. Similarly, for J = 5, 
the difference between —103.91 and —101.93 is less than log(5!). The log(J!) difference cannot therefore 



be used as a direct correction for Chib's ( 1995 1 approximation because of this difficulty in controlling 
the amount of overlap. However, it is unnecessary since using the permutation average resolves the 
difficulty. Table |3] shows that the prefered value of J for the galaxy dataset and the current choice of 
prior distribution is J = 5. 

When the number of components J grows too large for all permutations in &j to be considered in 
the average, a (random) subsample of permutations can be simulated to keep the computing time to a 
reasonable level when keeping the identity as one of the permutations, as in Table [3] for J = 6,7. (See 
Berkhof et al.| 2003 for another solution.) Note also that the discrepancy between the original Chib's 



( 1995 ) approximation and the average over permutations is a good indicator of the mixing properties of 
the Markov chain, if a further convergence indicator is requested. 

Example 13. (Exa mple [6] con tinued) For instance, in the setting of Example [6] with a = & = 1, both 

3|( |1995| ' ' " ' 



the approximation of |Chib| ( 1995[ ) and the symmetrized one are identical. When comparing a single class 
model with a two class model, the corresponding (log-) marginals are 



pi(x)=n 



r(l) T{n, + l/2)T{n-ni + l/2) 



and /52(x) 



jr(i/2)2 r(n + i) 

-523.2978, giving a clear preference to the two class model. 



= -552.0402 
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